February 1, 2016

Interdisciplinary Group

  • Computer Science
  • Applied Statistics and Scientific Computation
  • Computational Biology

Central theme

  • Computational and statistical method and tool development
  • Study the molecular basis of variation in development and disease
  • Using high-throughput experimental methods

Metagenomics

  • DNA sequencing of bacterial communities
  • Method and tool development: metagenomeSeq Bioconductor package
  • Large epidemiological study of childhood diahrrea in developing countries

Host-pathogen interaction

  • Joint RNA-sequencing of parasite and host cell during course of infection
  • Novel methodology for RNA-seq normalization and differential expression analysis

Computational Epigenomics

NHGRI strategic plan

[Nature, 2011]

NHGRI strategic plan

"The major bottleneck in genome sequencing is no longer data generation—the computational challenges around data analysis, display and integration are now rate limiting. New approaches and methods are required to meet these challenges."

  • Data analysis
  • Visualization
  • Data integration
  • Computational tools and infrastructure
[Nature, 2011]

Computational Epigenomics

Epigenomics and DNA methylation

Genes are expressed differently during different stages and in different tissues.

Epigenomics and DNA methylation

DNA is packed, making certain parts inaccessible, and this packing is dynamic.

DNA methylation is a chemical modification of DNA, involved in gene expression regulation.

[Robertson and Wolffe, Nat Rev Genet, 2000]

Probing DNA methylation

Probing DNA methylation

The data

The data

The data

DNA methylation in cancer

Large blocks of hypo-methylation in colon cancer

Nat. Genetics, 2011
  • overlaps with other important genomic domains
  • tissue-specific genes are over-represented within blocks

Hypo-methylation blocks observed across five solid tumor types.

Genome Medicine, 2014

This is a heterogeneous cell population

This is a heterogeneous cell population

Methylation pattern reconstruction problem

  • Given a set of mapped reads

Methylation pattern reconstruction problem

  • Given a set of mapped reads
  • Determine composition of cell-specific methylation patterns

Methylation pattern reconstruction problem

The statistic: number of reads in genomic region

  • regions (vertices) defined by overlapping reads with consistent methylation patterns
  • overlaps (edges) defined by overlapping reads with inconsistent methylation patterns
  • region coverage: total number of reads originating in region

The model: expected number of reads in genomic region

\[ \mathbb{E} y_v = \sum_{u:(v,u) |in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \]

The estimator

  • Penalized method of moments:
  • number of parameters = number of paths through graph
  • sparsity inducing penalty to obtain solution with small number of patterns

\[ \min_{\theta_p} \sum_v \lvert y_v - \sum_{u:(v,u)\in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \rvert + \lambda \sum_p \lvert \theta_p \rvert \]

How to solve efficiently

\[ \min_{\theta_p} \sum_v \lvert y_v - \sum_{u:(v,u)\in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \rvert + \lambda \sum_p \lvert \theta_p \rvert \]

If we interpret abundance as path flow, then we can rewrite in terms of edge flows

\[ f_{vu} = \sum_{p:(v,u) \in p} \theta_p \]

How to solve efficiently

\[ \min_{\theta_p} \sum_v \lvert y_v - \sum_{u:(v,u)\in E} \ell_{vu} \sum_{p:(v,u)\in p} \theta_p \rvert + \lambda \sum_p \lvert \theta_p \rvert \]

If we interpret abundance as path flow, then we can rewrite in terms of edge flows

\[ f_{vu} = \sum_{p:(v,u) \in p} \theta_p \]

\[ \min_{f \geq 0} \sum_v \lvert y_v - \sum_{u:(v,u)\in E} \ell_{vu} f_{vu} \rvert + \lambda f_{vt} \\ \textrm{s.t} \sum_{u:(v,u) \in E} f_{vu} = \sum_{w:(w,v) \in E} f_{wv} \]

How to solve efficiently

\[ \min_{f \geq 0} \sum_v \lvert y_v - \sum_{u:(v,u)\in E} \ell_{vu} f_{vu} \rvert + \lambda f_{vt} \\ \textrm{s.t} \sum_{u:(v,u) \in E} f_{vu} = \sum_{w:(w,v) \in E} f_{wv} \]

  • This is a linear optimization problem (LP)
  • We can solve efficiently for very large problems
  • Final solution obtained by path flow decomposition

Pattern reconstruction from whole genome bisulfite sequencing

Dataset of 50bp reads from mouse wild-type activated B cells, two types of progenitor cells (CLP and KSL).

Reconstruct patterns 4-100x basepair length

Pattern reconstruction from whole genome bisulfite sequencing

Reconstruct patterns with accurate marginal estimates

Pattern reconstruction from targeted bisulfite sequencing

Compare patterns across samples and populations

Moving Forward

  • We are now able to estimate methylation pattern composition for a single sample (e.g., normal or tumor)
  • How to detect differences between cell populations:
    • For paired normal-tumor data, find genomic regions where methylation pattern composition changes (significantly)?
    • For samples across developmental course, how does composition change as differentiation occurs?
  • How should we think about population-level inferences from measurements that are themselves inferred?

Cell-specific methylation pattern reconstruction

  • New view into molecular profiling
    • Complex relationship between cell-to-cell differences and average methylation differences
  • Efficient formulation as network flow problem
    • Exploring relationship to stochastic optimization
  • Code: https://github.com/hcorrada/methylFlow/
    • Includes incipient R package to parse and analyze resulting data
    • Takes SAM/BAM file input from Bismark and BSMap
    • Happy to collaborate on new analysis if you want to try it out
  • Why do we care about intra-tumor heterogeneity?

Hyper-variable expression

Genes with hyper-variable expression in colon cancer are enriched within these blocks.

Nat. Genetics, 2011

Gene expression hyper-variability enriched in hypo-methylation blocks in other cancer types.

Genome Medicine, 2014

Genes with consistent hyper-variable expression across tumors are tissue-specific.

BMC Bioinformatics, 2013

Summary

  • large domains of methylation loss are a stable mark across cancer types
  • gene expression hyper-variability is enriched within these domains
  • hyper-variable genes within these regions are tissue-specific and involved in cellular fate
  • determinants of expression variability in normal tissue [Nucleic Acids Research, 2013]

Gene expression anti-profiles

  • molecular methods for cancer detection, prognosis and treatment matching will be the basis of individualized medicine
  • gene expression profile methods have been subject of study for decades
  • very few proposed predictors are translated to the clinic
  • one of the biggest culprits is lack of replicability of results in preliminary studies

anti-profile score: measures sample-specific deviation from normal expression in consistently hyper-variable genes

BMC Bioinformatics, 2013

  • Feature selection: top 100 genes with greatest hyper-variable expression in tumor:

\[ \log_2 \frac{\text{std. dev}_{\text{cancer}}}{\text{std. dev}_{\text{normal}}} \]

  • Range of normal expression:

\[ \mathrm{med} \, \text{normal expression}_g \pm 5 \times \mathrm{mad} \, \text{normal expression}_g \]

  • anti-profile score: number of genes in sample where expression is outside normal range

Good cross-experiment properties
Stability in normal expression across experiments

BMC Bioinformatics, 2013

Prediction in leave-one-tissue out experiment

BMC Bioinformatics, 2013

Anti-profile score distinguishes between stages in tumor progression

Cancer Informatics, 2015

DNA methylation anti-profiles score distinguishes between stages in tumor progression

Cancer Informatics, 2015

Stratification based on anti-profile score

Cancer Informatics, 2015

Stratification of breast samples based on anti-profile score

Cancer Informatics, 2015

Summary

  • Simple counting scheme produces robust stable and accurate (anti)-profiles
  • Nice prediction properties across experiments and across tissue types
  • Captures increasing hyper-variability associated with progression and prognosis

Anomaly Classification

  • Distinguish observations from two anomalous groups (e.g., adenoma vs. tumor)
  • How can we incorporate the fact that we are classifying anomalies?
  • Why (and when) is it worth doing that?
  • Using function approximation methods to study predictor stability

Summary

  • Profiles learned based on hyper-variability show consistent behavior across tissues and across experiments in tumor prognosis and progression
  • We can extend the general anti-profile idea to a function approximation setting
  • Use sensitivity-based cross-validation error bounds to characterize the effect of incorporating normal observations when classifying between anomalies
  • Incorporating normal samples when building anomaly predictors improves stability and prediction performance

Moving forward

  • better understand connection between intra-tumor heterogeneity and consistent hyper-variability in cancer
  • how to understand population-level inferences from features (measurements) that are themselves inferred
  • move anti-profiles closer to the clinic
  • explore anomaly classification as a general learning setting
  • methods to understand hierarchical organization of epigenomic domains

  • Discovery: consistent hypo-methylation, hyper-variability
  • Methods: anomaly classification as a setting to understand predictor stability, methylation pattern reconstruction
  • Tools

Moving forward

  • collaborative computational and visual analysis
  • effective visual methods to explore hierarchical organization of (epi)-genome
  • deeper integration of statistically-informed visualization
  • visualization-informed statistical analysis

  • Discoveries: consistent hypo-methylation, hyper-variability
  • Methods: anomaly classification as a setting to understand predictor stability
  • Tools: computational and visual exploratory genomic data analysis

Metagenomics (mixed genomes)

NHGRI strategic plan

"Meeting the computational challenges for genomics requires scientists with expertise in biology as well as in informatics, computer science, mathematics, statistics and/or engineering."

A new generation of investigators who are proficient in two or more of these fields must be trained and supported.

Courses taught and created: - Undergraduate Computational Biology: sequence analysis and beyond - Graduate Functional Genomics - Undergraduate Data Science

Acknowledgements

Past members of HCBravo group
now at Harvard, U. Chicago, Johns Hopkins, Genentech, Dow Jones Data Science

Colleagues at CBCB
Current members of HCBravo group
Collaborators at JHU/Harvard

Funding: NIH, Genentech, Gates Foundation

More information

http://hcbravo.org
@hcorrada